De novo transcriptome assembly and gene annotation for the toxic dinoflagellate Dinophysis

Species within the dinoflagellate genus Dinophysis can produce okadiac acid and dinophysistoxins leading to diarrhetic shellfish poisoning. Since the first report of D. ovum from the Gulf of Mexico in 2008, reports of other Dinophysis species across US have increased. Members of the D. cf. acuminata complex (D. acuminata, D. acuta, D. ovum, D. sacculus) are difficult to differentiate due to their morphological similarities. Dinophysis feeds on and steals the chloroplasts from the ciliate, Mesodinium rubrum, which in turn has fed on and captured the chloroplasts of its prey, the cryptophyte Teleaulax amphioxeia. The objective of this study was to generate de novo transcriptomes for new isolates of these mixotrophic organisms. The transcriptomes obtained will serve as a reference for future experiments to assess the effect of different abiotic and biotic conditions and will also provide a useful resource for screening potential marker genes to differentiate among the closely related species within the D. cf. acuminata-complex. The complete comprehensive detailed workflow and links to obtain the transcriptome data are provided.


Background & Summary
Diarrhetic Shellfish Poisoning (DSP) is a human illness caused by consumption of shellfish contaminated with okadaic acid and/or dinophysistoxins. The organisms responsible for producing these toxins include species in the marine dinoflagellate genus Dinophysis. Although a total of 137 Dinophysis species are taxonomically accepted, only 10 are known to produce DSP when humans consume filter-feeding shellfish that have concentrated these species 1,2 . An unusual feature of Dinophysis is that they are mixotrophic-that is, they rely on both photosynthesis and prey capture. They accomplish this by feeding on and stealing the chloroplasts from the ciliate, Mesodinium rubrum, which in turn has fed on and captured the chloroplasts of its prey, the cryptophyte Teleaulax amphioxeia. Many single-celled plankton are now recognized as mixotrophs 3 .
Until recently, DSP-related shellfish closures were reported primarily in Asian and European waters. The first incidence of Dinophysis occurrence at bloom levels in US was reported in 2008 for the Texas coast and lead to the closure of shellfish harvesting 4,5 . In the past decade, Dinophysis blooms have increased in frequency nationwide, so all coasts in the US now face closures of shellfish industries, but each event is linked to a different Dinophysis species. In the Gulf of Mexico, DSP and shellfish closures have been attributed to D. ovum 4 . Shellfish harvesting closures have been linked to blooms of D. acuminata and D. fortii in Puget Sound, WA 6 , to D. acuminata in Massachusetts 7 , and to D. norvegica in Maine 8 . Multiple species of toxigenic Dinophysis are present in the Chesapeake Bay 9 . Because of the morphological and genetic similarity of D. acuminata and D. ovum, counts of these two-along with D. sacculus and D. acuta-are often lumped together as "D. cf. acuminata-complex" in monitoring programs utilizing light microscopy 9 . Recent studies, however, have shown that D. acuminata and D. ovum have unique toxin profiles 10 . The diversity of Dinophysis species and toxigenicity in different regions of the US suggests that effective management will require examination of the environmental factors that influence their growth.
The focus of this study was to develop reference transcriptomes for each component of this unique "food chain" (Fig. 1a). Although results for members of the Dinophysis food chain have been reported previously [11][12][13] , our focus was on two new isolates of Dinophysis (D. acuminata from the Chesapeake Bay, D. ovum from the Gulf of Mexico) and additional strains of Mesodinium rubrum and Teleaulax amphioxeia ( Table 1). The use of multiple strains of a single harmful algal species has been recommended to address the physiological variability  (Tables 2 and 3). The different sequencing depth between D. acuminata and D. ovum may explain the larger number of transcripts discovered for D. ovum. A reciprocal BLAST between the two Dinophysis species and clustering at 95% similarity yielded a total of 85,968 shared transcripts (Fig. 1b). The number of transcripts shared between the prey item M. rubrum-DK2009 and D. acuminata was 350 compared to 6,759 with D. ovum (Fig. 1b). These low numbers were expected because cultures of Dinophysis were extracted for analysis after all prey were depleted. Additionally, the number of transcripts shared between M. rubrum-JAMR and D. acuminata was 5,221 compared to 7,503 with D. ovum. A total of 54,540 transcripts were shared between M. rubrum-DK2009 and its prey, T. amphioxeia (Fig. 1b), and 49,297 between M. rubrum-JAMR and T. amphioxeia. The number of shared transcripts between the two M. rubrum strains DK2009 and JAMR was 43,115.
The assembled de novo transcriptomes for D. acuminata and D. ovum will serve as a reference for future experiments to assess the effect of different abiotic and biotic conditions and will also provide a useful resource for screening potential genes of interest to differentiate among the closely related species within the D. cf. acuminata-complex. The generated de novo transcriptomes for this collection of mixotrophic organisms will be a valuable resource for further downstream bioinformatics applications, including validation of gene expression, quantitative RNA-Seq analysis and comparative transcriptomics among strains of these harmful algal bloom species 14 .

Methods
Cell culturing and collection. Cultures of the kleptoplastic, mixotrophic species of Dinophysis, D. acuminata and D. ovum, the prey ciliate Mesodinium rubrum, and its prey, the cryptophyte Teleaulax amphioxeia (Table 1), were grown following the method described in Fiorendino et al. (10). Briefly, cultures were grown in L1-Si seawater medium 15 at a salinity of 22, 18 °C, and under 100 µmol quanta m −2 s −1 on a 14: 10 light: dark cycle. Cultures were harvested by centrifugation at 3000 g for 15 mins. The cryptophyte T. amphioxeia was harvested at mid-exponential stage (~day 6). The M. rubrum and Dinophysis cultures were fed their respective prey at a 1:10 (predator: prey) ratio and harvested after the complete consumption of their cryptophyte or ciliate prey, respectively.
RNA Extraction and sequencing. Total RNA was extracted from cell pellets using Extracta Plus RNA (QuantaBio, USA). Total RNA extraction was performed following the manufacturer′s guide. RNA concentration was measured using a Qubit RNA HS Assay kit (ThermoFisher Scientific, USA), and RNA integrity was evaluated using Agilent Fragment analyzer system (Agilent, USA).
Poly-A selected RNA libraries were prepared using the NEXTFLEX Rapid Directional RNA-seq kit 2.0 (Perkin Elmer, Waltham, MA) as per the manufacturer′s instructions. Each library was prepared with a unique barcode and pooled at equimolar concentrations. The pooled samples were sequenced on an Illumina NextSeq. 500 (Illumina, San Diego, CA) at a read length of 2 × 150 bp, targeting 60 million read pairs per sample.
De novo assembly and gene annotation. High quality RNA-Seq reads (sequences) were used to generate the de novo transcriptome assemblies using the bioinformatics tools illustrated in Fig. 2. Raw sequence reads in fastq format were processed to remove adapters, poly-N (⩾10% read length), low-quality bases (Phred score < 10) and the last 10 bases were trimmed using the bbduk function in BBMap tool v. 38.90 (https:// Fig. 1 (a) The food chain supporting the mixotrophic dinoflagellate Dinophysis, includes Mesodinium rubrum and Teleaulax amphioxeia 39 . Images from the Imaging FlowCytobot in the Gulf of Mexico, Texas coast 4 . Scale bar = 10 μm. (b) Venn diagrams showing the unique transcripts for each organism, with the shared transcripts shown in the overlapping areas. Note that the larger number of transcripts discovered for D. ovum was due to the higher sequencing depth, so the number of shared transcripts between D. ovum and M. rubrum also was higher.
www.nature.com/scientificdata www.nature.com/scientificdata/ sourceforge.net/projects/bbmap/). Reads shorter than 125 bp were also discarded. Forward and reverse reads were concatenated using the bbrepair function. Non-mRNA reads were removed using SortMeRNA v. 4.3.4 with rRNA databases as reference 16 . The mRNA reads were normalized for depth based on kmer counts using the BBNorm function. Summary statistics for the number of total reads before and after precleaning are presented in Table 2. De novo transcriptomes were generated using Trinity v. 2.12.0 17 with default settings and Velvet-master v. 1.2.10 18 -Oases-master v. 0.2.09 19 with default settings, except for minimum length criterion set as 300 bp for the shortest transcripts. Both de novo transcriptomes were merged using cd-hit-est v. 4.8.1 20 to reduce the transcript redundancy by 98% similarity and generate unique gene clusters. TransDecoder (https://github.com/ TransDecoder/TransDecoder) was used to identify coding regions (ORF) of the assembled transcripts. The generated de novo assemblies were functionally annotated using the NCBI non-reductant protein database (NR) using BLAST tool v. 2.110 21 . InterProScan v. 5.55-88.0 22 was used to identify potential proteins in pathways using the Pfam, PANTHER, Gene3D, SUPERFAMILY, TIGRFAM, HAMAP, SFLD, PRINTS datasets.  [23][24][25][26][27] and the Transcriptome Shotgun Assembly (TSA) at DDBJ/ENA/GenBank [28][29][30][31][32] . Annotated transcript datasets are deposited in Zenodo 33-37 .

Fig. 2
The bioinformatics tools used for assembly of the non-model organisms Dinophysis, Mesodinium, and Teleaulax. Quality trimming and filtering were accomplished with BBmap (https://sourceforge.net/projects/ bbmap/) and SortMeRNA 16 , followed by normalization with the BBnorm function and interleaving the forward reads (fwd) and reverse reads (rev) using the BBrepair function in the BBMap package. Assemblies were generated with Trinity 17 and Velvet-Oases 18,19 and merged with cd-hit-est at 98% 20 . Open reading frames of coding regions were identified using TransDecoder (https://github.com/TransDecoder/TransDecoder) and functional annotation of the resulting transcripts was performed using BLAST 21 against the NCBI NR database and predicted pathways were identified using InterProScan 22 . www.nature.com/scientificdata www.nature.com/scientificdata/

technical Validation
After the initial FastQC check and precleaning steps, we assembled the de novo transcriptome assemblies with Trinity 17 and Velvet-Oases 18,19 (Table 3). We found that Trinity and Velvet-Oases produced different numbers of transcripts. The number of transcripts generated by Trinity was twice the number of transcripts from Velvet-Oases. The Trinity-Velvet-Oases merged strategy resulted in longer transcripts. Transcriptome assembly validation was done using Benchmarking Universal Single-Copy Orthologs (BUSCO) v. 4.1.4 38 . BUSCO core genes provide a qualitative estimate of the de novo transcriptome quality and completeness based on the evolutionarily informed expectation of the gene content from the near-universally conserved eukaryotic protein database (eukaryote_odb90). All five de novo transcriptome assemblies indicated high-quality assemblies with BUSCO coverage of 60-89% (Table 3). The CoDing sequences (CDS) obtained using TransDecoder revealed the highest number of genes in D. ovum (DoSS3195) while M. rubrum (DK2009) had the lowest number of genes (Table 3). N50 statistics appropriate for the de novo transcriptome assemblies were generated using the Trinity accessory scripts (Table 3). Functional annotation for these genes was performed using BLASTp with the maximum 3 best hits per gene and an e-value cutoff of 1e-20. The number of annotated genes ranged from 55-82% of the total transcripts (Table 3).
Using the bioinformatics tools illustrated in Fig. 2, the total number of transcripts for D. ovum exceeded the number for D. acuminata; this was probably due to the greater sequencing depth for D. ovum ( Table 2). Note that although the number of transcripts in this analysis exceeded a previous report for M. rubum 12 , likely because of the increased depth of sequencing here, it is less than the number of transcripts identified by others 13 . To determine the number of transcripts shared between the two Dinophysis species, a reciprocal BLAST was performed and results clustered at 95% similarity (Fig. 1b). Table 2. Summary of RNA-seq results and number of reads after quality trimming, after removal of non-mRNA, and the final sequence reads used for assembly after normalization.  Table 3. Properties of the transcriptome assemblies.